Skip to content

Fiber#13

Open
daandemeyer wants to merge 65 commits intomainfrom
fiber
Open

Fiber#13
daandemeyer wants to merge 65 commits intomainfrom
fiber

Conversation

@daandemeyer
Copy link
Copy Markdown
Owner

No description provided.

keszybz and others added 30 commits April 29, 2026 11:23
As is often the case, in this case because of alignment, we are actually
not saving any space. With the bitfield we are using one bit of the 8 bytes
allocated, and without the bitfield we are using 8 bits of that.

But we're paying a price in generated code, at every access site to the
field:

$ diff <(objdump -S build/libsystemd.so.old) <(objdump -S build/libsystemd.so.new)
...
       v->protocol_upgrade = false;
-   fa2d2:	48 8b 45 a8          	mov    -0x58(%rbp),%rax
-   fa2d6:	0f b6 90 90 01 00 00 	movzbl 0x190(%rax),%edx
-   fa2dd:	83 e2 fe             	and    $0xfffffffe,%edx
-   fa2e0:	88 90 90 01 00 00    	mov    %dl,0x190(%rax)
+   fa2a9:	48 8b 45 a8          	mov    -0x58(%rbp),%rax
+   fa2ad:	c6 80 90 01 00 00 00 	movb   $0x0,0x190(%rax)
struct sd_varlink:
- /* size: 448, cachelines: 7, members: 21 */
+ /* size: 432, cachelines: 7, members: 21 */

struct sd_varlink_server:
- /* size: 160, cachelines: 3, members: 21 */
+ /* size: 152, cachelines: 3, members: 21 */
The intent was good, but we now print two or three of those messages
for each report metrics received on the wire. If the json object is
extensible, then it's all good and we don't need to inundate the user
with this trivial information. (And the message also sounds like
something is wrong or unexpected, when it totally isn't.)

...
(string):1:73: Unrecognized object field 'object', assuming extension.
(string):1:89: Unrecognized object field 'value', assuming extension.
json-stream: Received message: {"parameters":{"name":"io.systemd.Network.CarrierState","object":"virbr0","value":"degraded-carrier"},"continues":true}
(string):1:66: Unrecognized object field 'object', assuming extension.
(string):1:83: Unrecognized object field 'value', assuming extension.
json-stream: Received message: {"parameters":{"name":"io.systemd.Network.CarrierState","object":"lo","value":"carrier"},"continues":true}
(string):1:66: Unrecognized object field 'object', assuming extension.
(string):1:79: Unrecognized object field 'value', assuming extension.
json-stream: Received message: {"parameters":{"name":"io.systemd.Network.CarrierState","object":"wlp0s20f3","value":"carrier"},"continues":true}
(string):1:66: Unrecognized object field 'object', assuming extension.
(string):1:86: Unrecognized object field 'value', assuming extension.
...
Merge the two blocks adding tests, since there seems to be
no obvious reason to have two separate blocks, as they both
contain tests from the same libraries.
Generic Varlink API for services that hand out file descriptors to
storage volumes. Three methods: Acquire() returns an fd for a named
volume (optionally creating it from a template), ListVolumes()
enumerates available volumes, ListTemplates() enumerates supported
creation templates. Volume types follow kernel inode-type naming:
blk (block device), reg (regular file), dir (directory).

Intent is that multiple providers can sit behind AF_UNIX sockets in a
well-known directory and be consumed uniformly by nspawn, vmspawn,
the service manager (BindVolume=) and similar tools.
First implementation of io.systemd.StorageProvider, exposing all block
devices known to udev (disks, partitions, dm nodes, …) as volumes of
type "blk". Names are picked from stable /dev/mapper and /dev/disk/by-*
symlinks; content-derived identifiers (by-uuid, by-label, …) are
intentionally avoided for security. Volume creation is not supported by
this backend.

Socket-activated via /run/systemd/io.systemd.StorageProvider/block.
Also adds shared storage-util.[ch] (VolumeType / CreateMode helpers)
that subsequent providers reuse.
Second StorageProvider implementation, exposing regular files and
directories from a backing filesystem. In system mode the backing
directory is /var/lib/storage/, in user mode $XDG_STATE_HOME/storage/;
entries with a .volume suffix are exposed, with the inode type
determining whether the volume is reported as reg, dir or (via
symlinked/bind-mounted device node) blk.

Unlike the block provider, this one supports creating volumes
on-demand from a small set of built-in templates: sparse-file,
allocated-file, directory and subvolume.
CLI for inspecting and using storage providers. Scans
/run/systemd/io.systemd.StorageProvider/ (or the user-mode equivalent)
for AF_UNIX sockets and talks to each one over Varlink. Verbs:
"volumes" lists volumes across all providers, "templates" lists
supported creation templates, "providers" lists the endpoints
themselves.

Also installed as a mount.storage helper, so
'mount -t storage PROVIDER:VOLUME /mnt' (or 'mount -t storage.<fstype>'
to put a fresh filesystem on a block volume) acquires the volume and
mounts it. Ships with bash/zsh completions and a man page.
VM-only test that exercises both shipped providers through storagectl:
verifies the well-known sockets exist, lists providers/volumes/
templates, creates and acquires volumes from each template
(sparse-file, allocated-file, directory, subvolume), attaches a loop
device to cover the block provider, and exercises the mount.storage
helper.
Records the still-missing StorageProvider integrations (nspawn,
vmspawn, service-manager BindVolume=) and replaces the now-obsolete
generic "storage API via varlink" entry with a NetworkProvider
proposal modelled on it.
So strv_push_with_size() doesn't have to recalculate the size every
time.
The PR to measure into is closely associated with where we place a
resource in the initrd cpios. Hence, let's also track it in CpioTarget,
thus simplifying our function parameter lists that way.

No change in behaviour.
This loads the new 'extra' stanza, but doesn't actually do anything with
it yet. That's added in a later commit.

Replaces: systemd#39286

Implements: uapi-group/specifications#212
This generates on-the-fly cpio initrds from 'extra' resources declared
in Type #1 entries and installs them via the Linux initrd protocol so
that they get passed to the Linux kernel.

Replaces: systemd#39286
Verb dispatch is left untouched for now.

Co-developed-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixup for 8623980. This didn't
cause any problems until the conversion away from getopt_long().
--timeout-signal is now documented (fixup for
e209926).

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
keszybz and others added 29 commits April 29, 2026 15:03
…pe 1) (systemd#41863)

This implements the "extra" stanza for type 1 entries in systemd-boot,
see:


uapi-group/specifications@bde167a

It comes with a really thorough test suite matching our currently level
of testing of systemd-boot (read: there is none, I ask you to trust me,
Claude, and your review on this one)...

Split out of systemd#41543
option_parser_next_arg() is renamed to option_parser_peek_next_arg()
to match option_parser_consume_next_arg().

A new helper is added option_parser_get_arg(…, n). It is a common pattern
to only need a single arg, and getting an array and extracting a single
item from it is too verbose.
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
…to be used by vmspawn/nspawn/pid1 to provide storage volumes in a generic fashion (systemd#41776)

BindPath= in unit files, and --bind= in nspawn/vmspawn doesn't really
cut it to connect arbitrary storage infra to it. Let's do something
about it, and implement a simple, light-weight API for acquiring an fd
to a storage volume. Benefits:

1. the interface can be implemented by anyone, connecting anything to
vmspawn/nspawn/service management
2. very lose coupling: just bind a socket into a well-known dir, done
3. mounting can happen on-demand
This addresses some trivial points made by @keszybz in the PR review.
This is mostly stuff discussed in systemd#41776.
So strv_push_with_size() doesn't have to recalculate the size every
time.
…temd#41869)

FOREACH_ARRAY declares 'i' as the iterator but the body passed 'd' (the
array base) to block_device_done(). Since mfree() leaves the field NULL
after the first call, element 0 is freed repeatedly while elements
1..N-1 leak their node, symlinks strv, model, vendor and subsystem.

The bug predates the sanitizer-instrumented callers. PR systemd#41776's new
systemd-storage-block daemon runs blockdev_list() under ASan/LSan in
TEST-87-AUX-UTILS-VM and exposes it (15 allocs / 804 bytes leaked per
ListVolumes request). The fix also benefits repart and blockdev_list's
internal CLEANUP_ARRAY cleanup.

Follow-up for 9f6b274
Follow-up for 6b1324f

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
When /proc is bind-mounted read-only (common in mock/Koji buildroots,
containers, and other sandboxed environments), opening
/proc/sys/fs/binfmt_misc returns ELOOP if it is an automount point
that cannot be triggered in the read-only context.

Currently binfmt_mounted_and_writable() only handles ENOENT, so ELOOP
propagates as an error. This causes test-binfmt-util to fail with
SIGABRT and disable_binfmt() to log a spurious warning at shutdown.

Treat ELOOP and EACCES the same as ENOENT: binfmt_misc is not usably
available, return false.

Note: PR systemd#37006 (merged April 2025) addressed ELOOP in the xstatfsat()
path, but the open() call in binfmt_mounted_and_writable() remained
unhandled.

Fixes systemd#38070
If they are read-only they are no candidates, since we cannot write to
them.
Small hygiene fix.  r must be >= 0 as per the prior statement (otherwise we would have returned).  This is really only going to be r == 0, which means return r; is return 0;  I'm updating this to use log_debug_errno
…CLEAR_FUNC()

DEFINE_POINTER_ARRAY_CLEAR_FUNC() generates a helper of the form
helper_array_clear(T *array, size_t n) that drops each element but does
not free the array itself, parallel to DEFINE_POINTER_ARRAY_FREE_FUNC()
for cases where the array has automatic storage duration.

CLEANUP_ELEMENTS() pairs with these helpers to provide a _cleanup_-like
attribute for fixed-size arrays: the bound is taken from ELEMENTSOF(),
and the helper is invoked across the elements at scope exit. Compared to
CLEANUP_ARRAY(), the storage is neither freed nor zeroed.

Migrate various logic across the tree over to the new macros.

sd-device: use DEFINE_POINTER_ARRAY_CLEAR_FUNC() for sd_device_unref_array_clear()

Replace the local device_unref_many() helper with the macro-generated
equivalent.

format-table: switch help-table arrays to CLEANUP_ELEMENTS()

Generate table_unref_array_clear() via DEFINE_POINTER_ARRAY_CLEAR_FUNC()
and convert the help-table arrays in bootctl, cryptenroll, nspawn,
repart and vmspawn to CLEANUP_ELEMENTS(). The arrays no longer need a
trailing NULL slot, so the size matches ELEMENTSOF() of the groups
array.

firewall-util: switch netlink message arrays to CLEANUP_ELEMENTS()

Generate sd_netlink_message_unref_array_clear() via
DEFINE_POINTER_ARRAY_CLEAR_FUNC() in place of the NULL-terminated
sd_netlink_message_unref_many(), and convert the two stack arrays of
sd_netlink_message pointers to CLEANUP_ELEMENTS().
Let's cap the number of question each query can have to something
reasonable - 128 questions per query should be more than enough for any
real-world scenario.
Let's start with 1024, as that should be plenty for all sane use cases.
I am really not a fan of full code lines passed to macros as parameters.
Let's get rid of the 3rd parameter of FOREACH_OPTION() hence:

1. Let's return errors just as a regular value (though a negative one),
   that can be handled via a OPTION_ERROR case statement for the switch.
   This normalizes handling of the error, just like any other event
   returned by the option parser.

2. In order to avoid exploding the amount of boilerplate in each use
   (that just propagates the error on OPTION_ERROR), let's then
   introduce an explicit FOREACH_OPTION_OR_RETURN(), that returns from
   the calling function on its own (and makes that clear in the name).

Together this cleans up, normalizes the logic and shortens the code.
ucontext_t and the makecontext()/swapcontext() family are required by
upcoming fiber support, but musl deliberately does not ship them in
libc. libucontext provides standalone implementations of these and is
the canonical replacement on musl-based distributions.

Wire libucontext up as an optional dependency, required when
building against musl (where it's mandatory) and opt-in elsewhere.

When libucontext is built in freestanding mode (as it typically is on
glibc-based distributions that ship it), <libucontext/libucontext.h>
collides with <sys/ucontext.h> over REG_R8 and friends, and we can't
simply avoid <sys/ucontext.h> because <signal.h> pulls it in
unconditionally. Add override headers under src/include/override/ that
either forward to libucontext's header (and alias the makecontext
family to its libucontext_-prefixed counterparts) or fall through to
the system header via #include_next, depending on whether libucontext
is enabled.

The musl CI workflow gains libucontext-dev so the upcoming fiber code
compiles there.
Traditionally, asynchronous programming in systemd has been achieved using
sd-event along with the asynchronous interfaces of sd-bus and sd-varlink.
This works well when the system is reacting to events and all code triggered
by those events can run without blocking. In these scenarios, the global
Manager object is passed as userdata to the callback, and the callback can
use the stack as usual, declaring local state and ensuring proper cleanup via
_cleanup_. Control flow structures, such as loops, work as expected, and
everything runs smoothly.

However, challenges arise when the code needs to perform long-running
operations within these callbacks. Since the system cannot block execution
within the callback, we can't directly invoke a long-running operation and
wait for its result without introducing complexities. Instead, we need to
initiate the long-running task, register for completion with sd-event,
sd-bus, or sd-varlink, and provide a callback to be invoked when the
operation completes.

This callback, however, only receives a single userdata pointer, which
forces us to bundle all local variables into a struct and pass it along as
part of the callback. On top of that, after queuing the asynchronous
operation, the caller continues executing. As the caller's stack unwinds
when the function exits, the resources and state within the local scope may
be prematurely cleaned up. Therefore, the struct must store copies of the
local variables or ensure proper reference counting to prevent premature
resource cleanup.

When multiple long-running operations need to be initiated within a loop,
the complexity grows further. We must introduce additional shared state to
track the completion of all operations before we can run any code that
depends on their results.

Furthermore, since the daemon may be shut down at any time, we must track
the lifecycle of each long-running operation in the global Manager struct,
ensuring proper cleanup even when stack unwinding can no longer manage the
resources for us.

Fibers, or green threads, provide a more natural way of handling
asynchronous operations. By enabling cooperative multitasking within a
single thread, fibers allow us to write code that looks like it’s running
synchronously, but with the ability to yield control at predefined points,
such as when waiting for long-running tasks to complete.

With fibers, we can simplify the control flow by running asynchronous
operations within a fiber, allowing us to "pause" execution while waiting
for the long-running operation to finish and then "resume" the operation once
it's complete. This eliminates the need for multiple callback chains,
extensive state tracking, and the potential pitfalls of stack unwinding.

This commit introduces the ability to execute long-running operations in a
non-blocking manner while maintaining the simplicity and readability of
synchronous code. The fiber-based approach will significantly improve the
handling of complex workflows, making the code easier to write and maintain.

The implementation is based on ucontext.h and sd-event. ucontext.h provides
us with alternate stacks that we can switch between. The default stack size
is the same as a regular thread. Because we use mmap() to allocate the stack,
the memory won't actually be used until it is paged in by the kernel, so we
don't actually use 8MB per fiber.

To integrate fibers with the event loop, each fiber is assigned a deferred
event source which resumes the fiber when enabled. The deferred event source
is oneshot by default so the fiber will run immediately until it yields or
suspends. If it yields, the deferred event source is enabled again (oneshot)
immediately. If it suspends, before it suspends, one or more event sources
are registered with sd-event that will enable the deferred event source
(oneshot) to resume the fiber once the operation it is waiting for completes.

Yielding or suspending the fiber is done by calling sd_fiber_yield() or
sd_fiber_suspend() respectively. Both of these return zero on success or any
error value from the async operation that caused the fiber to resume.

This is also how fiber cancellation is implemented. When a fiber is cancelled,
sd_fiber_yield() and sd_fiber_suspend() will return ECANCELED when the fiber
is resumed, allowing the fiber to unwind its stack (which allows cleanup to
happen automatically) and finish.

Instead of having applications work directly with fibers, we hide them behind
a generic futures interface to represent long-running operations, regardless of
whether those operations are running on a fiber or not. Aside from fibers, the
futures library (sd-future) allows waiting for sd-event sources and doing sd-bus
calls in the background as well. Fibers can suspend until a future is ready with
sd_fiber_await().

The futures library has two sides. sd_future is the read side that consumers
hold and inspect; sd_promise is the embedded write side that producers use to
resolve it. The two share storage — the promise is a member of the future —
and producers recover the wrapping future from a promise via container_of().

Each future kind plugs into the library by providing an sd_future_ops vtable
(free, cancel, set_priority) and an opaque implementation struct via
sd_future_new(). The library treats the impl as a black box; the only
constraint is that its first field must be sd_promise*, which sd_future_new()
stamps with a back-pointer to the wrapping future. This lets handlers (e.g.
an sd-event IO callback) resolve the future from just the impl pointer
without having to keep a separate sd_future* around, and keeps producers
small — the IO future, time future, and bus-call future each fit in roughly
fifty lines.

A future starts in SD_FUTURE_PENDING and transitions exactly once to
SD_FUTURE_RESOLVED, carrying an integer result. Consumers can react to that
transition either by installing a one-shot callback with
sd_future_set_callback() (callback-style code) or by waiting on it from a
fiber via sd_fiber_await() (synchronous-looking fiber code). sd_fiber_await()
is itself built on a "wait future" that resolves when its target resolves;
sd_future_new_wait() exposes the same primitive directly so non-fiber callers
can chain futures without involving a fiber.

Cancellation is cooperative: sd_future_cancel() invokes the impl's cancel
callback, which is responsible for tearing down its work and ultimately
resolving the promise with -ECANCELED. For fiber futures this is what
surfaces as the ECANCELED return from sd_fiber_yield()/sd_fiber_suspend()
mentioned above.

Fire-and-forget fibers — created by passing a NULL ret to sd_fiber_new() —
take a self-reference on their future so they outlive the caller's scope.
The self-ref is dropped when the fiber resolves. This floating mechanism
(sd_fiber_set_floating()) is restricted to fiber futures because they
uniquely guarantee resolution; allowing it for arbitrary future kinds would
risk silent leaks for kinds that may never resolve.

Note that fiber cleanup depends on the runtime operating normally. Each
fiber's _cleanup_-style cleanups live on the fiber's own stack and run
only when the fiber is resumed and allowed to unwind, which requires a
working event loop to drive it to completion. The exit event source
registered for top-level fibers ensures unwind on a normal sd_event_exit(),
but if the event loop itself terminates abnormally (e.g. an unrecoverable
allocation failure mid-dispatch) before all fibers have resolved, their
stacks never unwind and any resources they own leak. This is a structural
property of stackful coroutines, shared with libraries like Boost.Coroutine
and libdill; for resources where leaking is unacceptable, callers must
arrange explicit teardown rather than relying solely on fiber-stack cleanup.

The code lives in libsystemd as sd-future (not exported) for the following reasons:
- We may want to make this a public libsystemd API in the future
- The code can't live in src/basic as it makes heavy use of sd-event
- The code can't live in src/shared as sd-bus and sd-event make use of it

The basic fiber definitions do live in src/basic as we need them in log-context.c
and log.c to give each fiber its own log context instead of every fiber operating
on the thread global log context.
Add a family of sd_fiber_*() I/O wrappers that, when called from a
fiber, behave like blocking I/O from the caller's perspective but
yield to the event loop instead of blocking the thread:

  sd_fiber_read / sd_fiber_write
  sd_fiber_readv / sd_fiber_writev
  sd_fiber_recv / sd_fiber_send
  sd_fiber_connect
  sd_fiber_recvmsg / sd_fiber_sendmsg
  sd_fiber_recvfrom / sd_fiber_sendto
  sd_fiber_accept
  sd_fiber_poll

All of them share a single helper, fiber_io_operation(), which when
invoked outside a fiber falls through to the underlying syscall
directly, preserving the regular blocking behaviour. Inside a fiber
the helper flips the fd to non-blocking (restoring its original mode
on return), tries the syscall once on the fast path, and on EAGAIN/
EWOULDBLOCK creates an sd-event-backed IO future via future_new_io(),
suspends the fiber, and retries the syscall once the event source
fires. Errors propagate as negative errno values, matching the
convention of other sd-* APIs.

future_new_io() itself is added to sd-event/event-future.{c,h} as a
new IoFuture kind. It wraps sd_event_add_io() into an sd_future:
oneshot enable, EPOLLERR translated via SO_ERROR (suppressed for
non-sockets), and the fd duplicated with F_DUPFD_CLOEXEC to avoid
EEXIST when multiple sources watch the same descriptor. Cancellation
disables the source and resolves the promise with -ECANCELED. It's
the same pattern as the time and child future kinds added in the
previous commit.

Together these let fiber-using code write straight-line socket and
pipe I/O without bundling state into callbacks. Tests covering the
fast path, suspend-and-retry path, fallback-when-not-on-a-fiber path,
cancellation while suspended, blocking-mode preservation, and shared
fd / multiple-fiber scenarios live in test-fiber-io.c.
Some helpers in src/basic — ppoll_usec_full() (used by fd_wait_for_event()),
loop_read(), loop_read_exact(), loop_write_full() and
pidref_wait_for_terminate_full() — block the calling thread. That's the
right behaviour outside a fiber but not inside one, where blocking the
thread also stalls every other fiber running on the same event loop.
Rewriting every caller to pick a fiber or non-fiber variant explicitly
would be a lot of churn and would split otherwise-shared code paths in
two.

Instead, the helpers detect at runtime whether they're running on a fiber
and dispatch to a suspending variant when they are. FiberOps in
fiber-def.h holds five function pointers (ppoll, read, write, timeout,
timeout_done); each Fiber stores a pointer to a const FiberOps that
sd_fiber_new() populates with sd_fiber_poll/sd_fiber_read/sd_fiber_write/
sd_fiber_timeout/sd_future_unref so the suspending implementations
themselves stay in libsystemd. FIBER_OPS_FORWARD() temporarily clears the
ops pointer around the dispatched call so the op's body can reuse the
non-redirected helpers without recursing.

- ppoll_usec_full() uses FIBER_OPS_FORWARD() at the top to tail-call the
  ppoll hook when on a fiber, otherwise falls through to the normal
  ppoll() body. ss must be NULL on a fiber since sd_fiber_poll() doesn't
  take a sigmask.

- loop_read()/loop_read_exact() call the read hook directly when on a
  fiber and the fd is blocking, which suspends on EAGAIN until data is
  available — making the do_poll knob and the explicit
  fd_wait_for_event() retry loop unnecessary in that path. When the fd
  is already non-blocking and do_poll is false the original
  return-EAGAIN-immediately semantic is preserved by falling through to
  the read() path.

- loop_write_full() likewise calls the write hook inside a
  FIBER_OPS_WITH_TIMEOUT() scope so the caller's timeout is honoured via
  a deadline future, mirroring SD_FIBER_TIMEOUT() but reachable from
  src/basic without pulling in sd-future.h. The timeout==0
  fast-return-EAGAIN semantic is preserved the same way.

- pidref_wait_for_terminate_full() polls the pidfd via fd_wait_for_event()
  before each waitid() when either a finite timeout is set or we're on a
  fiber, and requires pidref->fd >= 0 in those cases (returning
  -ENOMEDIUM otherwise — extending the rule that already applied to
  finite timeouts). The poll suspends the fiber via the ppoll hook above;
  the subsequent waitid() doesn't block because the pidfd is already
  signalled.

Tests in test-fiber-ops.c cover the suspending paths for these helpers,
the cooperative-scheduling ordering they enable across multiple fibers,
and the fall-through-to-blocking behaviour when called outside any
fiber.
…iber

sd_event_run() blocks the calling thread on the event loop's epoll fd
until something happens. When the caller is a fiber, that's the wrong
behaviour: blocking the thread also stalls every other fiber and the
outer event loop driving them. The most common way to hit this is a
fiber that creates its own inner event loop (e.g. a server-style fiber
that wants to dispatch its own sources independently of whatever loop
the test or supervising fiber is running on) — with the existing
implementation the inner sd_event_run() would hold the thread while the
outer scheduler should be free to advance other fibers.

Add an event_run_suspend() variant in sd-event/event-future.c that
performs the same prepare/wait/dispatch dance, but when the fast path
finds nothing ready it (a) creates an IO future watching the inner
event loop's epoll fd on the *outer* event loop, (b) optionally creates
a time future for the timeout, and (c) suspends the fiber. When either
future fires the fiber is resumed and the prepare/wait/dispatch sequence
runs once more to actually dispatch what's pending. sd_event_run()
checks sd_fiber_is_running() and delegates to this variant when on a
fiber; profile_delays accounting is intentionally skipped on that path
since the underlying prepare/wait/dispatch primitives already account
for themselves.

PROTECT_EVENT() moves from sd-event.c into a new event-util.h so it can
be reused by event_run_suspend() without exporting it as a libsystemd
symbol. test-event-future.c covers the suspending paths: zero-timeout
fast return, immediately-pending IO, IO arriving during suspension,
timer firing during suspension, repeated short-timeout calls (the
post-error SD_EVENT_ARMED state regression), and a nested fiber-driven
inner event loop running concurrently with an outer timer.
Three changes to teach sd-bus how to behave when called from a fiber, in
order of increasing depth:

1. bus_poll() now uses sd_fiber_ppoll() instead of ppoll_usec(). On the
   non-fiber path that's a transparent fall-through; on a fiber it
   suspends instead of blocking the thread, so other fibers and the
   surrounding event loop keep running while the bus waits for I/O.

2. sd_bus_call() now redirects to a new bus_call_suspend() helper when
   the caller is a fiber whose event loop is the same one the bus is
   attached to. The plain bus_poll() path serializes all bus traffic on
   the slot's reply (only one method call can be in flight per
   sd_bus*), which would defeat the point of running multiple fibers
   against one bus. bus_call_suspend() builds on the async sd-bus API:
   it wraps the call in a new BusFuture (sd-bus/bus-future.{c,h}) that
   resolves when the reply or method-error arrives, lets the fiber
   await that future, and surfaces the reply to the caller via
   future_get_bus_reply(). Because the futures live on the event loop
   rather than a per-bus slot, multiple fibers can drive concurrent
   method calls against the same bus.

3. A new private SD_BUS_VTABLE_METHOD_FIBER flag dispatches a vtable
   method handler on its own fiber, so handlers are free to use
   sd_bus_call() against the same bus, sd_fiber_sleep(), loop_read(),
   etc. without stalling the event loop for other connections or
   handlers. The flag stays out of sd-bus-vtable.h (its bit value is
   reserved there to prevent collisions) — the fiber runtime is a
   systemd-internal implementation detail.

Lifecycle of fiber-dispatched handlers is tracked on the bus itself: a
new bus->fiber_futures set holds a ref to each in-flight handler.
bus_enter_closing() cancels every entry and process_closing() returns
with the bus still in CLOSING state until the set drains, so we can be
sure no fiber handler outlives the bus. bus_fiber_resolved() removes
the entry on completion. bus_free()'s assert(set_isempty()) makes the
invariant load-bearing.

To exercise these changes the existing thread-based client/server
sd-bus tests (test-bus-chat, test-bus-objects, test-bus-peersockaddr,
test-bus-server, test-bus-watch-bind) are migrated to fibers, and a
new test-bus-fiber is added that covers SD_BUS_VTABLE_METHOD_FIBER —
including handlers that issue nested sd_bus_call() on the same bus, the
cancel-on-close path, and concurrent dispatches across multiple fibers.
Three changes, in increasing depth:

1. json_stream_wait() now uses sd_fiber_ppoll() instead of
   ppoll_usec(). On the non-fiber path that's a transparent
   fall-through; on a fiber it suspends instead of blocking the thread.
   Because all of varlink's synchronous client paths
   (sd_varlink_wait(), sd_varlink_call(), sd_varlink_collect()) drive
   their I/O through json_stream_wait(), this change alone makes them
   safe to call from a fiber.

2. Add varlink_server_bind_fiber() and varlink_server_bind_fiber_many()
   in varlink-util.{c,h} for registering a method handler that should
   run on a dedicated fiber per dispatch. The fiber-bound methods live
   in a separate s->fiber_methods map alongside the regular s->methods;
   bind_internal()/bind_many_internal() are factored out so the regular
   and fiber bind variants share their parsing/insertion code.
   Registering the same method in both maps is rejected because the
   dispatcher consults the regular map first and would otherwise
   silently shadow the fiber binding.

3. varlink_dispatch_fiber() builds a VarlinkFiberData (refs to the
   connection, parameters, and method name), spawns a fiber via
   sd_fiber_new(), and makes the future floating so the fiber
   self-manages its lifetime — neither the dispatcher nor the
   connection has to track it. The fiber's priority is set to one
   below the connection's quit event source so that on graceful
   shutdown the fiber's exit handler fires (and runs its cleanup)
   before varlink's quit_callback() closes the connection underneath
   it; this is what lets a fiber-bound handler reply or flush its
   sentinel on a still-open connection during shutdown.

The connection state transitions are reordered so they happen before
the fiber spawn rather than after the synchronous callback returns:
the fiber runs after dispatch has already moved past PROCESSING, which
matches the behaviour expected for a deferred reply (the fiber may
either reply immediately, or stash the connection and reply later, in
which case the post-callback logic treats it as a PENDING_METHOD).

The client/server varlink tests are migrated to fibers (threads → mock
server fibers on the same event loop) to exercise the new paths.
The synchronous qmp_client_call() pumps the event loop until its reply
arrives, pinning the parsed reply on c->current so it can hand out
borrowed pointers to the caller. That model only fits one in-flight
sync call: a second qmp_client_call() on the same client clears
c->current before issuing its own send, invalidating the first
caller's borrowed pointers. On a single-threaded event loop that was
fine, but with fibers two concurrent calls on the same client can
interleave through the pump (json_stream_wait() suspends the running
fiber) and trample each other.

Add three entry points:

- qmp_client_call_future(): the async building block. Returns an
  sd_future backed by a QmpFuture impl that owns the reply variant and
  a strdup'd error_desc. The reply callback resolves the promise;
  cancellation drops the pending slot so a late reply doesn't fire
  into freed memory and resolves the promise with -ECANCELED.

- future_get_qmp_reply(): borrowed-lifetime extraction from a resolved
  future. The pointers stay valid until the future is freed.

- qmp_client_call_suspend(): the convenience wrapper for fibers. Issues
  the call via qmp_client_call_future(), suspends the fiber, then
  surfaces result and error_desc through the same borrow contract as
  qmp_client_call(): valid until the next qmp_client_call*() on this
  client. The contract is implemented by pinning the resolved future
  on the client (current_call_future) and unref'ing the previous one
  on entry. Because the per-call state lives on the future rather than
  on a single c->current slot, multiple fibers can have their own
  in-flight calls on the same client without clobbering each other.

Then make qmp_client_call() detect when it's running on a fiber whose
event loop matches the client and transparently delegate to
qmp_client_call_suspend(), so existing call sites become safe under
concurrent fibers without source changes.

To make this work concurrently, we also change qmp_client_call() to 
hand out references and copies of errors so that we don't have to store
the borrowed pointers we hand out in the QmpClient struct.
The mock servers used to be driven out-of-band: each test created a
socketpair, forked a child, ran a hand-coded request/response script
against the raw fd, and sent SIGTERM to tear it down. That worked but
required pidref/process-util/signal plumbing in every test, two
distinct execution contexts that couldn't share state, and a JsonStream
attached to the mock side that pretended to be event-loop-driven while
actually being driven manually via blocking reads.

Now that JsonStream exposes suspending helpers, the mocks can live
inside the same process and event loop as the client. Each mock is
rewritten as an sd-fiber that runs alongside the client fiber: the
JsonStream uses the suspending json_stream_wait()/flush() variants,
so the mock fiber yields on I/O and the event loop schedules the
client in the meantime. Both sides progress cooperatively, no
fork/SIGTERM/PID tracking, no manual phase tracking.

Two cleanups fall out of the rewrite:

- A QMP_TEST(name, mock_fn) { ... } macro encapsulates the per-test
  scaffolding (event loop, socketpair, mock fiber spawn, exit-on-idle
  shim) and injects an already-connected QmpClient *client into the
  test body. Each test now reads as a flat sequence of
  qmp_client_call() invocations against that client.

- Repeated mock command/reply scripting is factored into
  mock_qmp_expect(), mock_qmp_reply(), mock_qmp_expect_and_reply(),
  mock_qmp_handshake(), and mock_qmp_query_status_running(). The
  greeting JSON is built with sd_json_buildo() instead of being parsed
  from a literal.

The file shrinks from 756 to 494 lines, mostly through deletions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants